Mini Project 2: Neural Network with Nonlinear Least Squares

Name: Fei Yin

PID: A15555426

In this project, we will approximate two nonlinear equations with neural network by solving nonlinear least squares problems. During the process, we will explore various initial conditions. Finally, we will evaluate the performances of the approximations.

We are given a function: $\hat{y} = f_W(x) = w_1\phi(w_2x_1 + w_3x_2 + w_4x_3 + w_5) + w_6\phi(w_7x_1 + w_8x_2 + w_9x_3 + w_{10}) + w_{11}\phi(w_{12}x_1 + w_{13}x_2 + w_{14}x_3 + w_{15}) + w_{16}$, where $\phi(x)$ is a sigmoid function, $\frac{e^x - e^{-x}}{e^x + e^{-x}}$, x's are data points, and w's are coefficients that we want to find that allows us to approximate various equations.

(1) $\nabla_W{f_W(x)}$ is a column vector of partial derivatives of $f_W(x)$ with respect to each of $w_1 ... w_{16}$

$$\nabla_W{f_W(x)} = \begin{bmatrix} \frac{\partial{f_W(x)}}{\partial{w_1}}\\ \frac{\partial{f_W(x)}}{\partial{w_2}}\\ \frac{\partial{f_W(x)}}{\partial{w_3}}\\ \frac{\partial{f_W(x)}}{\partial{w_4}}\\ \frac{\partial{f_W(x)}}{\partial{w_5}}\\ \frac{\partial{f_W(x)}}{\partial{w_6}}\\ \frac{\partial{f_W(x)}}{\partial{w_7}}\\ \frac{\partial{f_W(x)}}{\partial{w_8}}\\ \frac{\partial{f_W(x)}}{\partial{w_9}}\\ \frac{\partial{f_W(x)}}{\partial{w_{10}}}\\ \frac{\partial{f_W(x)}}{\partial{w_{11}}}\\ \frac{\partial{f_W(x)}}{\partial{w_{12}}}\\ \frac{\partial{f_W(x)}}{\partial{w_{13}}}\\ \frac{\partial{f_W(x)}}{\partial{w_{14}}}\\ \frac{\partial{f_W(x)}}{\partial{w_{15}}}\\ \frac{\partial{f_W(x)}}{\partial{w_{16}}}\\ \end{bmatrix} = \begin{bmatrix} \phi(w_2x_1 + w_3x_2 + w_4x_3 + w_5)\\ w_1\phi'(w_2x_1 + w_3x_2 + w_4x_3 + w_5)x_1\\ w_1\phi'(w_2x_1 + w_3x_2 + w_4x_3 + w_5)x_2\\ w_1\phi'(w_2x_1 + w_3x_2 + w_4x_3 + w_5)x_3\\ w_1\phi'(w_2x_1 + w_3x_2 + w_4x_3 + w_5)\\ \phi(w_7x_1 + w_8x_2 + w_9x_3 + w_{10})\\ w_6\phi'(w_7x_1 + w_8x_2 + w_9x_3 + w_{10})x_1\\ w_6\phi'(w_7x_1 + w_8x_2 + w_9x_3 + w_{10})x_2\\ w_6\phi'(w_7x_1 + w_8x_2 + w_9x_3 + w_{10})x_3\\ w_6\phi'(w_7x_1 + w_8x_2 + w_9x_3 + w_{10})\\ \phi(w_{12}x_1 + w_{13}x_2 + w_{14}x_3 + w_{15})\\ w_{11}\phi'(w_{12}x_1 + w_{13}x_2 + w_{14}x_3 + w_{15})x_1\\ w_{11}\phi'(w_{12}x_1 + w_{13}x_2 + w_{14}x_3 + w_{15})x_2\\ w_{11}\phi'(w_{12}x_1 + w_{13}x_2 + w_{14}x_3 + w_{15})x_3\\ w_{11}\phi'(w_{12}x_1 + w_{13}x_2 + w_{14}x_3 + w_{15})\\ 1\\ \end{bmatrix}$$

(2) Given $r_n(W) = f_W(x^{(n)}) - y^{(n)}$ and $r(W) = [r_1(W), ..., r_N(W)]$, $Dr(W)$ is a Jacobian matrix with size = N x n where each row is $\nabla_W{f_W(x^{(n)})^T}$. Since $y^{(n)}$ is not a function of $W$, it does not appear in $Dr(W)$.

$Dr(W) = \begin{bmatrix} \frac{\partial{f_W(x^{(1)})}}{\partial{w_1}} & \dots & \frac{\partial{f_W(x^{(1)})}}{\partial{w_{16}}}\\ \vdots & & \vdots\\ \frac{\partial{f_W(x^{(N)})}}{\partial{w_1}} & \dots & \frac{\partial{f_W(x^{(N)})}}{\partial{w_{16}}}\\ \end{bmatrix} $

(3) We want to minimize the following training loss function with respect to w:

$$l(W) = \sum_{n=1}^{N}{r_n}^2(W) + \lambda{\|W\|_2}^2$$

Using the Levenberg-Marquardt Algorithm, we want to solve the following:

$$ \begin{split} \min_{W} l(W) + \lambda'{\|W-W^k\|_2}^2 & = \min_{W} \sum_{n=1}^{N}{r_n}^2(W) + \lambda{\|W\|_2}^2 + \lambda'{\|W-W^k\|_2}^2\\ & = \min_{W} {\|r(W)\|_2}^2 + \lambda{\|W\|_2}^2 + \lambda'{\|W-W^k\|_2}^2\\ & \approx \min_{W} {\|\hat{r}(W;W^k)\|_2}^2 + \lambda{\|W\|_2}^2 + \lambda'{\|W-W^k\|_2}^2\\ & = \min_{W} {\|r(W^k) + Dr(W^k)(W - W^k)\|_2}^2 + \lambda{\|W\|_2}^2 + \lambda'{\|W-W^k\|_2}^2\\ \end{split} $$

At this step, we can formulate the problem as a linear least squares problem and obtain its solution by solving its corresponding normal equation.

Let: $$ \begin{split} A_1 = Dr(W^k) & \;\;\;\; & b_1 = Dr(W^k)W^k - r(W^k)\\ A_2 = I & \;\;\;\; & b_2 = 0\\ A_3 = I & \;\;\;\; & b_3 = W^k \end{split} $$

$$ \begin{split} A = \begin{bmatrix} A_1\\ \sqrt{\lambda}A_2\\ \sqrt{\lambda'}A_3 \end{bmatrix} \;\;\;\; b = \begin{bmatrix} b_1\\ \sqrt{\lambda}b_2\\ \sqrt{\lambda'}b_3 \end{bmatrix} \end{split} $$

And the previous equation becomes: $$ \begin{split} \min_{W} {\|r(W^k) + Dr(W^k)(W - W^k)\|_2}^2 + \lambda{\|W\|_2}^2 + \lambda'{\|W-W^k\|_2}^2 & = \min_{W} {\|AW - B\|_2}^2\\ A^T A W & = A^T b\\ (Dr(W^k)^T Dr(W^k) + \lambda I + \lambda' I)W & = Dr(W^k)^T (Dr(W^k)W^k - r(W^k)) + \lambda' W^k\\ & = (Dr(W^k)^T Dr(W^k) + \lambda' I)W^k - Dr(W^k)^T r(W^k) \end{split} $$

$$ W = (Dr(W^k)^T Dr(W^k) + \lambda I + \lambda' I)^{-1} ((Dr(W^k)^T Dr(W^k) + \lambda' I)W^k - Dr(W^k)^T r(W^k)) $$

The above $W$ represents $W^{k+1}$ given $W^k$ as input. Certainly, $x$, $y$, $\lambda$, and $\lambda'$ are also needed. To realize the Levenberg-Marquardt Algorithm, we first define several basic functions below: norm, sigmoid, and sigmoid_d. Note that sigmoid_d calculates the output of the sigmoid function's derivative defined as: $$ \phi'(x) = \frac{4}{e^{2x} + e^{-2x} + 2}$$

Next, we define functions calculate_f and calculate_r to realize $f_W(x)$ and $r(W)$ respectively.

Using sigmoid and sigmoid_d, we can also define the gradient of $f_W(x)$ with respect to $W$, or $\nabla_W{f_W(x)}$, by implementing calculateGradientW. Then, using calculateGradientW, we can define the Jacobian of $r(W)$, or $Dr(W)$, by implementing calculateJacobianW.

With all the above functions, we now implement the Levenberg-Marquardt Algorithm with the function iterateLM. We apply 4 stopping criteria:

  1. When small residual (p. 393 on textbook), $\left\lVert r(W) \right\rVert^2$, is small enough (set to 0.001 in the applications)
  2. When small optimality condition residual (p. 393 on textbook), $\left\lVert 2 Dr(W)^T r(W) \right\rVert^2$, is small enough (set to 100 in the applications)
  3. When number of iteration is too large (set to 1000 in the applications)
  4. (Optional) When $\lambda'$ is much larger or much smaller than the input $\lambda'$

Finally, the error metrics used for the training and testing results is defined as: $$ \frac{1}{N} \sum_{n=1}^{N}{|\frac{y^{(n)} - f_W(x^{(n)})}{y^{(n)}}|} $$ which is implemented by the calculateError function below.

Training y = x1x2 + x3

This is the first equation we want to model. We start by generating 500 random 3-vectors as our data vectors.

We then calculate y, the labels, with the generated x values.

In order to roughly pick the best initial $\lambda$ and $\lambda'$, we permute across multiple choices of the two variables, run iterateLM 5 times for each, and calculate the average error for these 5 runs to represent the error resulted by each pair of $\lambda$ and $\lambda'$. Note that the initial w values are randomly picked with np.random.randn, the same function used to generate the data vectors x. prune_by_lamb_d is turned on here to save time.

We can plot the training loss from each run, grouped by its corresponding pair of $\lambda$ and $\lambda'$. The plots are shown below. Note: Some plots are empty because their corresponding initial choices of $\lambda'$ cause the residual to never decrease. Consequently, no training loss was recorded. Some plots have only a few iterations because those runs are quickly terminated as prune_by_lamb_d is turned on.

After getting the average errors resulted by each pair of $\lambda$ and $\lambda'$, we construct a table showing all the errors with respect to each pair of $\lambda$ and $\lambda'$. From the table, we can decide what $\lambda$ and $\lambda'$ are roughly the best initial conditions for the function that we are modeling.

Choosing the $\lambda$ and $\lambda'$ that resulted in the smallest error from the tests above, we then run iterateLM 10 times and pick the final set of coefficients, w, that results in the smallest error. prune_by_lamb_d is turned off here.

The plot below shows training loss vs iterations for the 10 runs. Notice that not all of the 10 runs give good results. Because we initialize w randomly, it is necessary that we run iterateLM multiple times with the same pair of $\lambda$ and $\lambda'$.

Testing y = x1x2 + x3

After traininig the coefficients that model this equation, we now test its performance on different sets of data, with different number of data vectors. We create 3 new sets of data-vectors with size 50, 150, and 300, and corresponding set of labels for each. Note that the testing data-vectors are created in the same way as the training ones, using np.random.randn.

We then apply the trained coefficients to calculate each $f_W(x)$ and its error with respect to the corresponding labels.

The errors for 2 of the 3 test cases are slightly larger than the training error, but are mostly consistent, which means our model generalizes on the testing data pretty well. The error metrics (calculateError) used for the testing cases are the same as that of the training case.

Training y = x1 (x2 - x3)

This is the second equation we want to model. Except for the labels being calculated differently, all code are identical as the code used for training the first equation. We again start by generating 500 random 3-vectors as our data vectors.

We then calculate y, the labels, with the generated x values.

In order to roughly pick the best initial $\lambda$ and $\lambda'$, we permute across multiple choices of the two variables, run iterateLM 5 times for each, and calculate the average error for these 5 runs to represent the error resulted by each pair of $\lambda$ and $\lambda'$. Note that the initial w values are randomly picked with np.random.randn, the same function used to generate the data vectors x. prune_by_lamb_d is turned on here to save time.

We can plot the training loss from each run, grouped by its corresponding pair of $\lambda$ and $\lambda'$. The plots are shown below. Note: Some plots are empty because their corresponding initial choices of $\lambda'$ cause the residual to never decrease. Consequently, no training loss was recorded. Some plots have only a few iterations because those runs are quickly terminated as prune_by_lamb_d is turned on.

After getting the average errors resulted by each pair of $\lambda$ and $\lambda'$, we construct a table showing all the errors with respect to each pair of $\lambda$ and $\lambda'$. From the table, we can decide what $\lambda$ and $\lambda'$ are roughly the best initial conditions for the function that we are modeling.

Choosing the $\lambda$ and $\lambda'$ that resulted in the smallest error from the tests above, we then run iterateLM 10 times and pick the final set of coefficients, w, that results in the smallest error. prune_by_lamb_d is turned off here.

The plot below shows training loss vs iterations for the 10 runs. Notice that not all of the 10 runs give good results. Because we initialize w randomly, it is necessary that we run iterateLM multiple times with the same pair of $\lambda$ and $\lambda'$.

Testing y = x1 (x2 - x3)

After traininig the coefficients that model this equation, we now test its performance on different sets of data, with different number of data vectors. We create 3 new sets of data-vectors with size 50, 150, and 300, and corresponding set of labels for each. Note that the testing data-vectors are created in the same way as the training ones, using np.random.randn.

We then apply the trained coefficients to calculate each $f_W(x)$ and its error with respect to the corresponding labels.

The errors for all 3 test cases are slightly larger than the training error, but are mostly consistent, which means our model generalizes on the testing data pretty well. The error metrics (calculateError) used for the testing cases are the same as that of the training case.

Summary

With the method of permuting multiple choices of $\lambda$ and $\lambda’$, I am able to choose a pair of $\lambda$ and $\lambda’$ that results in the smallest error among the pairs that I have tested. Then, by running iterateLM for 10 times using that pair of $\lambda$ and $\lambda’$ and choosing the set of coefficients that results in the smallest error, I am able to determine a good choice of coefficients for the above two functions, as was reflected by the training and testing errors. However, my method is obviously imperfect because there are lots of choices of $\lambda$ and $\lambda’$ that I have not considered. Furthermore, my data points are constructed using np.random.randn, which creates random values with mean=0 and variance=1. This certainly does not cover all possible values, which means my model is only assured to work for a small portion of all possible data vectors. Lastly, through trial and error, I found that only a small selection of functions can be modeled by our neural network. The original function given in the assignment can be modeled because it is simple enough. But when I try to model more complex functions, such as a sine, cosine, or exponential, I found that the resulting errors are well over 100%. Based on my discovery, my guess is that, because our neural network only has 1 layer, it can only model functions with limited complexity.